An Analysis of the Number of Comparisons in Quicksort¶
$$ \newcommand{\defeq}{\mathrel{\mathop:}\mathrel{\mkern-1.2mu}=} $$
Sorting Algorithms¶
Sorting Algorithms are very versatile and are used in many applications. One of the probably most well-known sorting algorithms is Quicksort. The algorithm works as follows:
- Choose a pivot element from the list.
- Partition the list so that all elements with values less than the pivot come before the pivot and all elements with values greater than the pivot come after the pivot.
- Recursively apply the algorithm to the sublists.
Remark¶
For reasons of simplicity, we will assume that the list contains only distinct elements.
Number of Comparisons¶
During partitioning, the algorithm compares each element in the list with the pivot element. This means for each iteration of the algorithm, the number of comparisons made is $n-1$, where $n$ is the number of elements in the list.
The number of comparisons made by Quicksort affects the efficiency of the algorithm i.e. the time it takes to sort the list.
In the best case, the pivot element is the median of the list, and the number of comparisons is minimized.
In the worst case, the pivot element is the smallest or largest element in the list, and the number of comparisons is maximized.
Theorem 1: The worst-case number of comparisons made by Quicksort is $\tfrac{(n-1)n}{2}$.¶
Proof:¶
The worst-case number of comparisons made by Quicksort is when the pivot element is the smallest or largest element in the list. In this case, the list is partitioned into two sublists of size $n-1$ and $0$. The number of comparisons made by Quicksort in this case is $n-1$ plus the number of comparisons made by Quicksort on the sublists. Therefore, the worst-case number of comparisons made by Quicksort is $$ \sum_{i=0}^{n-1} i = \tfrac{(n-1)n}{2}. $$
Theorem 2: The best-case number of comparisons made by Quicksort is in $O(n \log n)$.¶
Proof: (informaly)¶
In every iteration we have $n-1$ comparisons. By splitting the list in half we will get roughly $2\cdot \tfrac{n}{2}=n$ comparisons. Since we half the list size every iteration we need $\log n$ iterations to reach a list size of $1$ or $0$. That means over all iterations we made $$ \approx n \log n = O(n \log n), $$ comparisons.
Strategies to Reduce the Number of Comparisons¶
Picking multiple elements (e.g. 3) and choosing their median as pivot greatly reduces the amount of needed comparisons.
Picking a uniformly distributed random element as the pivot does not reduce the amount of needed comparisons on average over all(!) lists, but it can give more "predictable" runtimes.
Combining both approaches gives very good results.
Finding the average number of comparisons of Randomized Quicksort¶
Let $n\in\mathbb{N}$, $X^{(0)}_\emptyset(n)$ and $X^{(k)}_{s}(n)$ for $s\in\{l, r\}^k$, $k\in\mathbb N$, be random variables that independently uniformly at random select a pivot element from a list of size $n$.
We can now rewrite the Randomized Quicksort algorithm as follows:
- Start with $k=0$ and $s=\emptyset$.
- Choose a pivot element $X^{(k)}_{s}(n)$ from the list of size $n$.
- Partition the list so that all elements with values less than the pivot come before the pivot and all elements with values greater than the pivot come after the pivot.
- If the sublist is empty or has only one element, stop. Otherwise, apply the algorithm to the sublists by incrementing $k$ and adding $l$ respectively $r$ to $s$.
Let $C(n)$ be the random variable that represents the number of comparisons made by Randomized Quicksort on a list of size $n$. We want to find the expected value of $C(n)$.
Let $x_{(1)}, \dots, x_{(n)}$ be the elements of the list such that $x_{(1)} < \dots < x_{(n)}$. Let $i, j\in\{1, \dots, n\}$ with $i < j$ and $C_{i, j}$ be the random variable that is $1$ if $x_{(i)}$ is compared to the $x_{(j)}$ and $0$ otherwise. Then we have $$ C(n) = \sum_{i=1}^{n-1}\sum_{j=i+1}^{n} C_{i, j},. $$
Remark¶
$C_{i, j}$ is $1$ if and only if either $x_{(i)}$ or $x_{(j)}$ is chosen as the pivot element before any of the $j - i - 1$ elements $x_{(i)}, \dots, x_{(j)}$, because picking any of the elements in between would separate $x_{(i)}$ and $x_{(j)}$.
It follows by induction that the probability that $x_{(i)}$ is compared to $x_{(j)}$ is independent of the size of the list and we get $$ \mathbb E\left[C_{i, j}\right] = \mathbb P(C_{i, j} = 1) = \tfrac{2}{j - i + 1}. $$
Proof:¶
We look at the list $x_{(i)}, \dots, x_{(j)}$. The probability that $x_{(i)}$ or $x_{(j)}$ is chosen as the pivot element is $\tfrac{2}{j - i + 1}$.
Now we assume that for a list of size $n\in\mathbb N$ that contains the elements $x_{(i)}, \dots, x_{(j)}$, the probability that $x_{(i)}$ is compared to $x_{(j)}$ is $\tfrac{2}{j - i + 1}$.
We will write $x'_{(1)}, \dots, x'_{(n)}$ for the list, where $x'_{(1)}=x_{(i)}$, …, $x'_{(j-i+1)}=x_{(j)}$ and $x'_{(j-i+2)}$, …, $x'_{(n)}$ are the remaining elements of the list.
Let $X$ be the random variable that selects a uniformly distributed random element from a list of size $n+1$. If we add now another element $x'_{(n+1)}$ to this list, the probability that $x_{(i)}$ is compared to $x_{(j)}$ is
$$ \begin{aligned} \mathbb P\left[\left\{C_{i, j} = 1\right\}\right] &= \sum_{k=1}^{n+1} \mathbb P\left[\left\{X = x'_{(k)}\right\}\right] \cdot \mathbb P\left[\left\{C_{i, j} = 1\right\} \middle| \left\{X = x'_{(k)}\right\}\right] \\ &= \frac{2}{n+1} + \sum_{k=j-i+2}^{n+1} \frac{1}{n+1} \cdot \mathbb P\left[\left\{C_{i, j} = 1\right\} \middle| \left\{X = x'_{(k)}\right\}\right] \\ &= \frac{2}{n+1} + \frac{1}{n+1} \sum_{k=j-i+2}^{n+1} \frac{2}{j-i+1} \\ &= \frac{2}{n+1} \cdot \left(1 + \frac{n+1-(j-i+2)+1}{j-i+1}\right) \\ &= \frac{2}{n+1} \cdot \frac{n+1}{j-i+1} \\ &= \frac{2}{j-i+1}. \end{aligned} $$
Lemma¶
Let $n\in\mathbb N$. Then $$ \sum_{i=1}^{n-1} \sum_{j=2}^{n+1-i} \tfrac{1}{j} = \sum_{j=2}^{n} \sum_{i=1}^{n-j+1} \tfrac{1}{j}. $$
Proof:¶
Let $n=1$. Then $$ \begin{aligned} \sum_{i=1}^{1-1} \sum_{j=2}^{1+1-i} \tfrac{1}{j} &= 0, \\ \sum_{j=2}^{1} \sum_{i=1}^{1-j+1} \tfrac{1}{j} &= 0. \end{aligned} $$
Now let $n\in\mathbb N$. Then $$ \begin{aligned} \sum_{i=1}^{n} \sum_{j=2}^{n+2-i} \tfrac{1}{j} &= \sum_{i=1}^{n-1} \sum_{j=2}^{n+2-i} \tfrac{1}{j} + \sum_{j=2}^{n+2-n} \tfrac{1}{j} \\ &= \sum_{i=1}^{n-1} \sum_{j=2}^{n+2-i} \tfrac{1}{j} + \tfrac{1}{2} \\ &= \sum_{i=1}^{n-1}\left(\frac{1}{n+2-i} + \sum_{j=2}^{n+1-i} \tfrac{1}{j}\right) + \tfrac{1}{2} \\ &= \sum_{i=1}^{n-1}\frac{1}{n+2-i} + \sum_{i=1}^{n-1} \sum_{j=2}^{n+1-i} \tfrac{1}{j} + \tfrac{1}{2} \\ &\rlap{= \sum_{i=3}^{n+1}\frac{1}{i} + \sum_{j=2}^{n} \sum_{i=1}^{n-j+1} \tfrac{1}{j} + \tfrac{1}{2}}\hphantom{= \sum_{i=2}^{n+1}\frac{1}{i} + \sum_{j=2}^{n} \sum_{i=1}^{n+1-j+1} \tfrac{1}{j} - \sum_{j=2}^{n} \frac{1}{n-j+2}} \end{aligned} $$
$$ \begin{aligned} \hphantom{\sum_{i=1}^{n} \sum_{j=2}^{n+2-i} \tfrac{1}{j} }&= \sum_{i=2}^{n+1}\frac{1}{i} + \sum_{j=2}^{n} \sum_{i=1}^{n-j+1} \tfrac{1}{j} \\ &= \sum_{i=2}^{n+1}\frac{1}{i} + \sum_{j=2}^{n}\left( \sum_{i=1}^{n+1-j+1} \tfrac{1}{j} - \frac{1}{n-j+2} \right) \\ &= \sum_{i=2}^{n+1}\frac{1}{i} + \sum_{j=2}^{n} \sum_{i=1}^{n+1-j+1} \tfrac{1}{j} - \sum_{j=2}^{n} \frac{1}{n-j+2} \\ &= \sum_{i=2}^{n+1}\frac{1}{i} + \sum_{j=2}^{n} \sum_{i=1}^{n+1-j+1} \tfrac{1}{j} - \sum_{j=2}^{n} \frac{1}{j} \\ &= \frac{1}{n+1} + \sum_{j=2}^{n} \sum_{i=1}^{n+1-j+1} \tfrac{1}{j} \\ &= \sum_{j=2}^{n+1} \sum_{i=1}^{n+1-j+1} \tfrac{1}{j} \end{aligned} $$
Theorem 3: Expected number of comparisons made by Randomized Quicksort¶
The expected number of comparisons made by Randomized Quicksort on a list of size $n$ is $$ \mathbb E[C(n)] = (2n + 2)H_n - 4n $$ where $H_n = \sum_{i=1}^{n} \tfrac{1}{i}$ is the $n$-th harmonic number.
Using that $H_n=O(\log n)$ [3] we get $$ \mathbb E[C(n)] = O(n\log n). $$
Proof:¶
$$ \begin{aligned} \mathbb E[C(n)] &= \mathbb E\left[\sum_{i=1}^{n-1}\sum_{j=i+1}^{n} C_{i, j}\right] \\ &= \sum_{i=1}^{n-1}\sum_{j=i+1}^{n} \mathbb E[C_{i, j}] \\ &= \sum_{i=1}^{n-1}\sum_{j=i+1}^{n} \tfrac{2}{j - i + 1} \\ &\rlap{= \sum_{i=1}^{n-1}\sum_{j=2}^{n+1-i} \tfrac{2}{j}} \hphantom{= 2(n + 1)\sum_{j=2}^{n} \tfrac{1}{j} - 2(n - 1)} \end{aligned} $$
$$ \begin{aligned} \hphantom{\mathbb E[C(n)]}&= \sum_{j=2}^{n} \sum_{i=1}^{n-j+1} \tfrac{2}{j} \\ &= \sum_{j=2}^{n} \tfrac{2(n-j+1)}{j} \\ &= 2(n + 1)\sum_{j=2}^{n} \tfrac{1}{j} - 2\sum_{j=2}^{n} 1 \\ &= 2(n + 1)\sum_{j=2}^{n} \tfrac{1}{j} - 2(n - 1) \\ &= 2(n + 1)\sum_{j=1}^{n} \tfrac{1}{j} - 4n \\ &= 2(n + 1)H_n - 4n \\ &= (2n + 2)H_n - 4n. \end{aligned} $$
Moment einmal! ... oder zweimal?¶
Higher Moments of Randomized Quicksort¶
Let $n\in\mathbb N_0$, $k\in\mathbb N_0$ and $s\in\{l, r\}^k$.
If $n\leq 1$, we define $C^{(k)}_s(n) \defeq 0$. Otherwise, $$ \begin{aligned} C^{(k)}_s(n) &\defeq n - 1 + C^{(k+1)}_{s\times\{l\}}\left(X^{(k)}_{s}(n) - 1\right) + C^{(k+1)}_{s\times\{r\}}\left(n - X^{(k)}_{s}(n)\right). \\ \end{aligned} $$
Lemma:¶
For all $n, k, m\in\mathbb N_0$ and $s\in\{l, r\}^k$, $t\in\{l, r\}^m$, $C^{(k)}_s(n)$ and $C^{(m)}_t(n)$ are identically distributed.
Lemma:¶
For all $n, k\in\mathbb N_0$ and $s\in\{l, r\}^k$, $C^{(k+1)}_{s\times\{l\}}(X^{(k)}_{s}(n) - 1)$ and $C^{(k+1)}_{s\times\{r\}}(n - X^{(k)}_{s}(n))$ are idenpendent.
We can now write the following recursive formula for $C(n)$: $$ \begin{aligned} C(n) = C^{(0)}_{\emptyset}(n) = n - 1 + C^{(1)}_{\{l\}}(X^{(0)}_\emptyset(n) - 1) + C^{(1)}_{\{r\}}(n - X^{(0)}_\emptyset(n)). \end{aligned} $$
The idea is that the number of comparisons made by Randomized Quicksort on a list of size $n$ can be expressed as the sum of the number of comparisons made by Randomized Quicksort on the sublists and the number of comparisons made during partitioning.
Theorem 4: The probability generating function of $C(n)$¶
Let $n\in\mathbb N_0$. The probability generating function of $C(n)$ is $$ g_n(t) \defeq g_{C(n)}(t) = \begin{cases} 1 & \text{if } n\leq 1, \\ t^{n-1} + \tfrac{1}{n}\sum_{k=1}^{n} g_{k-1}(t) \cdot g_{n-k}(t) & \text{otherwise}. \end{cases} $$
Proof:¶
Let $n\in\mathbb N_0$. If $n\leq 1$, we have $$ \begin{aligned} g_n(t) = \mathbb E[t^{0}] = 1. \end{aligned} $$ Otherwise, we have $$ \begin{aligned} g_n(t) &= \mathbb E\left[t^{n-1 + C^{(1)}_{\{l\}}(X^{(0)}_\emptyset(n) - 1) + C^{(1)}_{\{r\}}(n - X^{(0)}_\emptyset(n))}\right] \\ &\rlap{= \mathbb E\left[t^{n-1} \cdot t^{C^{(1)}_{\{l\}}(X^{(0)}_\emptyset(n) - 1)} \cdot t^{C^{(1)}_{\{r\}}(n - X^{(0)}_\emptyset(n))}\right]}\hphantom{= t^{n-1} \cdot \sum_{k=1}^{n} \mathbb P\left[\left\{X^{(0)}_\emptyset(n) = k\right\}\right] \cdot \mathbb E\left[t^{C^{(1)}_{\{l\}}(k - 1)} \cdot t^{C^{(1)}_{\{r\}}(n - k)}\right]} \end{aligned} $$
$$ \begin{aligned} &= t^{n-1} \cdot\mathbb E\left[ t^{C^{(1)}_{\{l\}}(X^{(0)}_\emptyset(n) - 1)} \cdot t^{C^{(1)}_{\{r\}}(n - X^{(0)}_\emptyset(n))}\right]\\ \hphantom{g_n(t)}&= t^{n-1} \cdot \sum_{k=1}^{n} \mathbb P\left[\left\{X^{(0)}_\emptyset(n) = k\right\}\right] \cdot \mathbb E\left[t^{C^{(1)}_{\{l\}}(k - 1)} \cdot t^{C^{(1)}_{\{r\}}(n - k)}\right] \\ \hphantom{g_n(t)}&= t^{n-1} \cdot \sum_{k=1}^{n} \tfrac{1}{n} \cdot \mathbb E\left[t^{C^{(1)}_{\{l\}}(k - 1)}\right]\mathbb E\left[ t^{C^{(1)}_{\{r\}}(n - k)}\right] \\ \hphantom{g_n(t)}&= t^{n-1} \cdot \tfrac{1}{n} \sum_{k=1}^{n} \mathbb E\left[t^{C(k - 1)}\right]\mathbb E\left[ t^{C(n - k)}\right] \\ &= t^{n-1} \cdot \tfrac{1}{n}\sum_{k=1}^{n} g_{k-1}(t) \cdot g_{n-k}(t). \end{aligned} $$
We can now use the probability generating function to calculate the higher moments of $C(n)$, since $$ \begin{aligned} \mathbb E[C(n)^k] &= \left(t\frac{\partial}{\partial t}\right)^{k}g_n(t)\bigg|_{t=1},\\ \mathbb E\left[\left(C(n) - \mathbb E[C(n)]\right)^k\right] &= \left(t\frac{\partial}{\partial t}\right)^{k}\left(\frac{g_n(t)}{t^{E[C(n)]}}\right)\bigg|_{t=1}. \end{aligned} $$
Definition: Generalized Harmonic Numbers¶
Let $n\in\mathbb N$ and $k\in\mathbb N$. The $n$-th generalized harmonic number of order $k$ is $$ H_{n, k} \defeq \sum_{i=1}^{n} \tfrac{1}{i^k}. $$
With our newly acquired knowledge, we can try to find explicit expressions for the moments of $C(n)$. Let $k\in\mathbb N$. [2] claims that the $k$-th moment of $C(n)$ has an explicit expression given by a polynomial in $n$, $H_{n, 1}$, …, $H_{n, k}$. No prove is given in [2], and there is also no claim about which degree the polynomial has. But the claim is supported by the fact that the first eight moments of $C(n)$ have been computed and are given in [2].
from sympy import symbols, Function, solve, summation, Rational, diff, expand, collect
g_functions_example = {0:1, 1:1}
t = symbols('t', integer=True)
def g_recursive_example(n):
if n in g_functions_example.keys():
return g_functions_example[n]
else:
g = expand(t ** (n - 1) * Rational(1, n) * sum(g_recursive_example(k - 1) * g_recursive_example(n - k) for k in range(1, n + 1)))
g_functions_example[n] = g
return collect(g, t)
g_recursive_example(6)
g_functions_example
{0: 1,
1: 1,
2: t,
3: 2*t*t**2/3 + t**2/3,
4: t**3*t**3/3 + t**2*t**3/6 + t*t**3/2,
5: 2*t**4*t**6/15 + t**4*t**5/15 + t**4*t**4/5 + 4*t**3*t**4/15 + t**2*t**4/3,
6: 2*t**5*t**10/45 + t**5*t**9/45 + t**5*t**8/15 + 4*t**5*t**7/45 + 2*t**5*t**6/9 + t**5*t**5/18 + 7*t**4*t**5/18 + t**3*t**5/9}
def pgf2moment(g, v, k):
g = v * diff(g, v, 1) # z * d/dz applied once
for _ in range(k - 1): # Apply (z d/dz) multiple times
g = v * diff(g, v, 1)
return g.subs(v, 1) # Evaluate at z=1
def compute_moment(p, order, centred):
amount = len(list(f(range(order + 1), order, p + 1))) # Number of monomials; we need at least this many equations to solve for the moment
coefficients = symbols(' '.join([f'c{i}' for i in range(amount)]))
harmonic_symbols = [Function(f'H{i}', integer=True) for i in range(1, p + 1)]
# Construct the polynomial using harmonic numbers and unknown coefficients
polynomial = sum(
coefficients[i] * prod(h(n) ** p if h != n else n ** p for h, p in zip([n] + harmonic_symbols, powers))
for i, powers in enumerate(f(range(order + 1), order, p + 1))
)
if centred:
momentValues = [pgf2moment(g_functions[i+1]/(t**(diff(g_functions[i+1], t, 1).subs(t, 1))), t, p) for i in range(amount + 1)]
else:
momentValues = [pgf2moment(g_functions[i+1], t, p) for i in range(amount + 1)]
equations = [
polynomial.subs({n: i, **dict([(harmonic_symbols[l](n), harmonic_number(i, l+1)) for l in range(len(harmonic_symbols))])})
- momentValues[i-1]
for i in range(1, amount + 2)
]
solution = solve(equations, coefficients)
return (polynomial.subs(solution), len(solution) != 0)
compute_moment(1, 2, False)[0]
compute_moment(2, 3, True)[0]
centred_moments[3]
def search_moment(p, centred):
found = False
order = p
while not found:
order += 1
moment, found = compute_moment(p, order, centred)
return moment
import random as r
def quicksort(arr, random=True):
if len(arr) <= 1:
return (arr, 0)
else:
pivot = r.randrange(len(arr)) if random else 0
piv_elem = arr[pivot]
left = []
right = []
for i in range(len(arr)):
if i == pivot:
continue
if arr[i] < piv_elem:
left.append(arr[i])
else:
right.append(arr[i])
left, comp_left = quicksort(left)
right, comp_right = quicksort(right)
return (left + [piv_elem] + right, len(arr) - 1 + comp_left + comp_right)
quicksort([3,2,1,5])
([1, 2, 3, 5], 4)
import time
from scipy.stats import moment
def measureSort(l, f):
r.shuffle(l)
start_time = time.time_ns()
comp = f(l)[1]
end_time = time.time_ns()
execution_time = end_time - start_time
return (int(comp), float(execution_time))
def computeMoment(f, k = 1, centred=True, iterations = 1000):
return moment([f() for i in range(iterations)], order=k, center=None if centred else 0.0)
compareMoment(30, order=1, centred=False, iterations = 1000)
compareMoment(30, order=2, centred=True, iterations=5000)
compareMoment(30, order=3, centred=True, iterations=20000)
Probability Distribution of the Number of Comparisons¶
Given the probability generating function of $C(n)$, we can now also compute the probability distribution of $C(n)$ for a given $n\in\mathbb N$. Let $k\in\mathbb N_0$. Then we have
$$ \mathbb P\left[\left\{C(n) = k\right\}\right] = \frac{1}{k!}\frac{\partial^k g_n(t)}{\partial t^k}\bigg|_{t=0}. $$
createAnimation(30)
createAnimation(1000, 10)
Conclusion¶
We have seen that the expected number of comparisons made by Randomized Quicksort tends to the best-case and the algorithm performs well on average.
It is hard to conclude anything concrete about the runtime since the swapping of elements is not considered in this analysis. However, the number of comparisons is a good indicator of the runtime of the algorithm.
A similar analysis could be done for the number of swaps made by Randomized Quicksort, as in [1].